Skip to content

Modify LTimes to match Kripke LTimes#684

Open
michaelmckinsey1 wants to merge 7 commits into
developfrom
ltimes-block25
Open

Modify LTimes to match Kripke LTimes#684
michaelmckinsey1 wants to merge 7 commits into
developfrom
ltimes-block25

Conversation

@michaelmckinsey1

@michaelmckinsey1 michaelmckinsey1 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Matching parameters like zones, groups, moments, and directions, the runtime of RAJAPerf LTimes is equivalent (<1%) to Kripke LTimes on CPU. But I have noticed on GPU, runtime varies by ~15% on CUDA and ~60% on ROCm. After these changes CUDA runtime is within 1% and ROCm within 6% (CUDA is now faster, ROCm actually slower with block 25). This is because AMD wavefront size 64 on block 25 wastes significantly more threads than NVIDIA warp size 32.

Summary

  • This PR is a refactoring
  • It does the following (modify list as needed):
    • Modifies ell layout from d contiguous to m contiguous to make accesses coalesced. This matches Kripke ZGD, where Field_Ell is layout ordering moment as stride-1. Doesn't apply to psi because does not depend on m (threads access same value), phi because no d (already contiguous in m).
      • This change made ~5% runtime difference on CUDA
    • Changes default block size to be m instead of 256 and remapped the kernel so z -> blockIdx.x, g -> blockIdx.y, m -> threadIdx.x. This changes the launch to grid=(num_z, num_g, 1), block=(m, 1, 1), which matches the kripke launch. However, running at different legendre orders (m) will result in different block sizes.
    • Would need to make similar changes for LTIMES-NOVIEW? done

… inner loop. Grid/block size changes to match Kripke Ltimessdom
@michaelmckinsey1 michaelmckinsey1 self-assigned this Jun 10, 2026
@michaelmckinsey1 michaelmckinsey1 marked this pull request as ready for review June 17, 2026 21:37
@michaelmckinsey1 michaelmckinsey1 changed the title [WIP] Possible LTimes Changes Modify LTimes to match Kripke LTimes Jun 17, 2026
@MrBurmark

Copy link
Copy Markdown
Member

Yes you would have to make LTIMES and LTIMES_NO_VIEW match.

Comment thread src/apps/LTIMES.hpp Outdated
Comment thread src/apps/LTIMES.hpp
@michaelmckinsey1

Copy link
Copy Markdown
Contributor Author

Yes you would have to make LTIMES and LTIMES_NO_VIEW match.

Done

Comment thread src/apps/LTIMES-Cuda.cpp
RAJA::statement::CudaKernelAsync<
RAJA::statement::For<1, RAJA::cuda_block_x_loop, // z
RAJA::statement::For<2, RAJA::cuda_block_y_loop, // g
RAJA::statement::For<3, RAJA::cuda_thread_x_loop, // m

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is a non-size loop policy because it is in ltimes. Here we know the block size at compile time, is that also true in kripke?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Kripke, the block sizes are determined by parameters passed in at runtime. For example, like in this version of LTimes, it will be blocked on zones and groups. The exact parameters which will be blocked are not always the same for each Kripke run. For instance, if we use the DZG layout at runtime, then the loops will be blocked with directions and zones (while groups are threaded).

@MrBurmark MrBurmark requested a review from rchen20 June 19, 2026 21:53
static const size_t default_gpu_block_size = 256;
using gpu_block_sizes_type = integer::make_gpu_block_size_list_type<default_gpu_block_size,
integer::MultipleOf<32>>;
static const size_t default_gpu_block_size = 25;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was a block size of 25 chosen? Would it be better for both GPU platforms to set this to 32?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the square of legendre order + 1. In kripke this is also the case, the default legendre is 4, so (4+1)^2 means the kernel in kripke will be (5,5,1)

In RAJAPerf, we set m=25 directly (and this is the default). So m=36 would then be equivalent to legendre=5 in kripke.

Comment thread src/apps/LTIMES-Cuda.cpp
RAJA::statement::For<1, RAJA::cuda_global_size_z_direct<z_block_sz>, //z
RAJA::statement::For<2, RAJA::cuda_global_size_y_direct<g_block_sz>, //g
RAJA::statement::For<3, RAJA::cuda_global_size_x_direct<m_block_sz>, //m
RAJA::statement::CudaKernelAsync<

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I've reverted LTimes to launch synchronously in Kripke, for correctness. This is fine though because the direction loop is inner-most, which should avoid race conditions.

@michaelmckinsey1 michaelmckinsey1 Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been async in RAJAPerf, I just changed it from CudaKernelFixedAsync to CudaKernelAsync, but for completeness I can make it CudaKernel. I don't this would matter for performance in RAJAPerf.

@MrBurmark

Copy link
Copy Markdown
Member

Do we want to only have this kripke conforming tuning, or should does it make sense to keep the current tuning as well?

@artv3

artv3 commented Jun 22, 2026

Copy link
Copy Markdown
Member

@michaelmckinsey1 take a look here: https://github.com/llnl/RAJA/blob/develop/benchmark/ltimes.cpp, it would be cool to also have a GPU shared memory version as a tuning!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants